智能论文笔记

Cross-Modal Similarity-Based Curriculum Learning for Image Captioning

Hongkuan Zhang , Saku Sugawara , Akiko Aizawa , Lei Zhou , Ryohei Sasano , Koichi Takeda

分类：计算机视觉 | 自然语言处理

2022-12-14

Image captioning models require the high-level generalization ability to describe the contents of various images in words. Most existing approaches treat the image-caption pairs equally in their training without considering the differences in their learning difficulties. Several image captioning approaches introduce curriculum learning methods that present training data with increasing levels of difficulty. However, their difficulty measurements are either based on domain-specific features or prior model training. In this paper, we propose a simple yet efficient difficulty measurement for image captioning using cross-modal similarity calculated by a pretrained vision-language model. Experiments on the COCO and Flickr30k datasets show that our proposed approach achieves superior performance and competitive convergence speed to baselines without requiring heuristics or incurring additional training costs. Moreover, the higher model performance on difficult examples and unseen data also demonstrates the generalization ability.

translated by 谷歌翻译

句子嵌入方法有许多成功的应用。但是，根据监督信号，在结果句子嵌入中捕获了哪些属性。在本文中，我们专注于具有相似体系结构和任务的两种类型的嵌入方法：一种关于自然语言推理任务的微型预训练的语言模型，以及其他微型训练的训练语言模型在单词预测任务上根据其定义句子，并研究其属性。具体而言，我们使用两个角度分区的STS数据集比较他们在语义文本相似性（STS）任务上的性能：1）句子源和2）句子对的表面相似性，并在下游和探测任务上比较其表现。此外，我们尝试结合两种方法，并证明将两种方法组合起来比无监督的STS任务和下游任务的各自方法的性能要好得多。

translated by 谷歌翻译

在本文中，我们考虑在对每个群集建模时，即基于模型的时间序列群集时，将一组单个时间序列集群的任务。该任务需要一个具有足够灵活性的参数模型来描述各个时间序列中的动力学。为了解决这个问题，我们提出了一种基于模型的时间序列聚类方法，该方法具有线性高斯状态空间模型的混合物，具有很高的灵活性。提出的方法对混合模型使用一种新的期望最大化算法来估计模型参数，并使用贝叶斯信息标准确定簇数。模拟数据集上的实验证明了该方法在聚类，参数估计和模型选择中的有效性。该方法应用于真实的数据集，该数据集以前提出的时间序列聚类方法表现出低精度。结果表明，与使用先前方法获得的方法相比，我们的方法产生的聚类结果更准确。

translated by 谷歌翻译